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O (57) Abstract: The present invention generally relates to the field of noise reduction systems which are equipped with an audio-vi- 
sual user interface, in particular to an audio- visual speech activity recognition system (200b/c)of a video-enabled telecommunication 
^\ device which nins a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an 
^ environment where a speaker's voice is inteifered by a statistically distributed background noise (n^t)) including both environmental 
noise (n(t)) and soiroonding persons' voices. 
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5 Noise Reduction and Audio- Visaal Speech Activity Detection 

FIELD AND BACKGROUND OF THE INVENTION 

The present invoition generally relates to the field of noise reduction based on speech ac- 
10 tivity recognition, in particular to an audio-^ual user interface of a telecomnoiunication 
device running an application that can advantageously be used e.g. for a near-speaker de- 
tection algorithm in an environment where a speaker's voice is interfered by a statistically 
distributed background noise including environmental noise as well as surrounding per- 
sons' voices. 

15 

Discontinuous transmission of speech signals based on speech/pause detection represents a 
valid solution to improve the spectral efiSciency of new-generation wireless communication 
systems. Iq this context, robust voice activity detection algorithms are required, as conven- 
tional solutions according to the state of the art present a high misclassification rate in the 
20 presence of the background noise typical of mobile environments. 

A voice activity detector (VAD) aims to distinguish between a speech signal and several 
types of acoustic background noise even with low signal-to-noise ratios (SNRs). Therefore, 
in a typical telephone conversation, such a VAD, together with a comfort noise generator 

25 (CNG), is used to achieve silence compression. In the field of multimedia communications, 
silence compression allows a speech channel to be shared with other types of information, 
thus guaranteeing simultaneous voice and data applications. In cellular radio systems which 
are based the Discontinuous Transmission (DTX) mode, such as GSM, VADs are applied 
to reduce co-channel interference and power consumption of the portable equipment. Fur- 

30 thennore, a VAD is vital to reduce the average data bit rate in future generations of digital 
cellular networks such as the UMTS, which provide for a variable bit-rate (VBR) speech 
coding. Most of the capacity gain is due to the distinction between speech activity and in- 
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activity. The perfonnance of a speech coding ^proach which is based on phonetic classi- 
fication, however, strongly depends on the classifia:, which must be robust to every type of 
background noise. As is well known, the performance of a VAD is critical for the overall 
speech quality, in particular with low SNRs- In case speech jS^aes are detected as noise, 
5 intelligibility is seriously impaired owing to speech clipping in the conversation. If, on the 
other hand, the percentage of noise detected as speech is high, the potential advantages of 
silence compression are not obtained. In the presence of background noise it may be diffi- 
cult to distinguish between speech and silence. Hence, for voice activity detection in wire- 
less environments more efficient algorithms are needed. 

10 

Althou^ the Fuzzy Voice Activity Detector (FVAD) proposed in ,Jfaiproved VAD G.729 
Annex B for Mobile Communications Usmg Soft Computing" (Contribution ITU-T, Study 
Qrovp 16, Question 19/16, Washington, September 2-5, 1997) by F. Beritelli, S. Casale, 
and A, Cavallaio performs better than other solutions presented in literature, it exhibits an 

15 activity increase, above all in the presence of non-stationary noise. The functional scheme 
of the FVAD is based on a traditional pattern recognition approach wherein the four differ- 
ential parameters used for speech activity/inactivity classification are the full-band energy 
difference, the low-band energy difference, the zero-crossing difference, and the spectral 
distortion. The matching phase is performed by a set of fuzzy rules obtained automatically 

20 by means of a new hybrid learning tool as described in , J^uGeNeSys: Fuzzy Genetic Neural 
System for Fuzzy Modeling" by M. Russo (to appear in IEEE Transaction on Fuzzy Sys- 
tems). As is well known, a fuzzy system allows a gradual, continuous transition rather than 
a sharp change between two values. Thus, the Fuzzy VAD returns a continuous output sig- 
nal ranging from 0 (iion-activity) to 1 (activity), which does not depend on whether single 

25 input signals have exceeded a predefined threshold or not, but on an overall evaluation of 
the values they have assumed („defuzzyfication process**). The fimal decision is made by 
comparing the output of die fuzzy system, which varies in a range between 0 and 1, with a 
fixed threshold experimentally chosen as described in "Voice Control of the Pan-European 
. Digital Mobile Radio System" (ICC '89, pp. 1070-1074) by C. B. Southcott et al 

30 

Just as voice activity detectors conventional automatic speech recognition (ASR) systems 
also experience difficulties when being operated in noisy environments since accuracy of 
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conventional ASR algoritfuns largely decreases in noisy environments. When a speaker .is 
talking in a noisy environmrat including both ambient noise as well as suironnding per- 
sons' interfering voices, a microphone picks up not only the speaker's voice but also these 
background sounds. Consequently, an audio signal which encompasses the speaker's voice 
5 superimposed by said background sounds is processed. The louder the interfering sounds, 
the more the acoustic comprehensibility of the speaker is reduced. To overcome tiiis prob- 
lem, noise reduction circmtries are ^plied that take use of the different frequency regions 
of environmental noise and the respective speaker's voice. 

10 A typical noise reduction circuitiy for a telephony-based s^pUcation based on a speech 
activity estimation algorithm according to the state of the art that implements a method for 
correlating the discrete signal spectrum S^b^ of an analog-to-digital-converted audio sig- 
nal s(t) with an audio speech activity estimate is shown in Fig. 2a. Said audio speech activ- 
ity estimate is obtained by an amplitude detection of the digital audio signal s{nT). The 

15 circuit ou^uts a noise-reduced audio signal s^{nT) , which is calculated by subjecting the 

difference of the discrete signal spectrum S(h£iJ) and a sampled version (A: • 40 of the 
estimated noise power density spectrum 0^(f) of a statistically distributed background 
noise n(t) to an Inverse Fast Fourier Transform (IFFT). 

20 BRIEF DESCRIPTION OF THE STATE OF THE ART 

The invention described in US 5,313,522 refers to a device for facilitating comprehension 
by a hearing-impaired person participating in a telephone conversation, which comprises a 
circuitry for converting received audio speech signals into a series of phonemes and an 

25 arrangement for coupling the circuitry to a POTS line. The circuit thereby includes an ar- 
rangement which correlates the detected series of phonemes with recorded lip movements 
of a speaker and displays these lip movements m subsequent images on a display device, 
thereby permitting the hearing-impaired person to carry out a lipreading procedure while 
listening to fhe telephone conversation, which improves the person's level of comprehen- 

30 sion. 
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The invention disclosed in WO 99/52097 pertains to a communication device and a method 
for sensing the movements of a speaker's lips, generating an audio signal cotresponding to 
detected lip movements of said speaker and transmitting said audio signal, thereby sensing 
a level of ambient noise and accordingly controlling the power level of the audio signal to 
5 be transmitted. 

OBJECT OF THE UNDERLYING INVENTION 

In view of the state of the art mentioned above, it is the object of the present invention to 
10 CThance the speech/pause detection accuracy of a telephony-based voice activity detection 
(VAD) system. In particular, it is the object of the invention to increase the signal-to-inter- 
fermce ratio (SIR) of a recorded speech signal in crowded environments where a speaker's 
voice is severely interfered by ambient noise and/or surrounding persons* voices. 

IS The aforementioned object is achieved by means of the features in the independent claims. 
Advantageous features are defined in the subordinate claims. 

SUMMARY OF THE INVENTION 

20 The present invention is dedicated to a noise reduction and automatic speech activity rec- 
ognition system having an audio-visual user interface, wherein said system is adapted for 
running an application for combining a visual feature vector OvynT that comprises features 
extracted from a digital video sequence v(nT) showing a speaker's face by detecting and 
analyzing e.g. lip movements and/or facial expressions of said speaker Si with an audio 

25 feature vector Oa,nT which conqnrises features extracted from a recorded analog audio se- 
quence s(t). Said audio sequence s{t) thereby represents the voice of said speaker inter- 
fered by a statistically distributed background noise 

n\t) ^ n{t)'^sUtl (1) 

30 

which includes both environmental noise n{t) and a wei^ted sum of surrounding persons' 
interfering voices 
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^Mit) ^ Z^r^/^-^i) tf^^^^O (2a) 

witha^-— ^[m"^] (2b) 

5 in the environment of said speaker Si, Thereby, N denotes the total number of speakers (in- 
clusive of said speaker SI), aj is the attenuation factor for the interference signal Sjif) of the 
y-th speaker ^ in the environment of the speaker Su Tj is the delay of jr/r), and Rjm denotes 
tiie distance between the j-th speaker and a microphone recording tiie audio signal s{t). 
By tracking the lip movement of a speaker, visual features are extracted which can then be 

10 analyzed and used for furth^ processing. For this reason, the bimodal perceptual user inter- 
&ce comprises a video camera pointing to the speaker's face for recording a digital video 
sequence v(n7) showing lip movements and/or facial expressions of said speaker St, audio 
feature extraction and analyzing means for determining acoustic-phonetic speech charac- 
teristics of the speaker's voice and pronunciation based on the recorded audio sequence 

15 s(t), and visual feature extraction and analyzing meajas for continuously or intermittently 
determining the current location of the speaker's face, tracking lip movements and/or facial 
expressions of the speaker in subsequent images and determining acoustic-phonetic speech 
characteristics of the speaker's voice and pronunciation based on the detected lip move- 
ments and/or facial expressions. 

20 

According to the invention, the aforementioned extracted and analyzed visual features are 
fed to a noise reduction circuit that is needed to increase the signal-to-interference ratio 
(SIR) of the recorded audio signal s{ty Said noise reduction circuit is specially adapted to 
perform a near-speaker detection by separating the speaker's voice from said background 
25 noise n(t) based on the derived acoustic-phonetic speech characteristics 



QjxvjiT •= [fifl^r^ Qy^/]^ (3) 
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It outputs a speech activity indication signal (SfinT)) which is obtained by a combination 
of speech activity estimates supplied by said audio feature extraction and analyzing means 
as well as said visual feature extraction and analyzing means. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Advantageous features, aspects, and useful embodiments of the invention will become evi- 
dent fiom the following description, the ^p^ded claims, and the accompanying drawings. 
Thereby, 

Fig. 1 shows a noise reduction and speech activity recognition system having an audio- 
visual user interface, said system being specially ads^ted for ruiming a real-time 
lip tracking application which combines visual features Oy^nT extracted from a 
digital video sequence v(n7) showing the face of a speaker Si by detecting and 
analyzing the speaker's lip movements and/or facial expressions with audio fea- 
tures Oa,nT extracted ftom an analog audio sequence s(t) representing the voice of 
said speaker Si interfered by a statistically distributed background noise n'(0» 

Fig. 2a is a block diagram showing a conventional noise reduction and speech activity 
recognition system for a telephony-based appUcation based on an audio speech 
activity estimation according to the state of the art. 

Fig. 2b shows an example of a camera-'euhanced noise reduction and speech activity 
recognition system for a telephony-based application that implements an audio- 
visual speech activity estimation algorithm according to one embodiment of the 
present invention. 

Fig. 2c. shows an example of a camera- enhanced noise reduction and speech activity 
recognition system for a telephony-based appUcation that implements an audio- 
visual speech activity estimation algorithm according to a further embodiment of 
the present invention. 
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Fig. 3a shows a flow chart illustratiiig a near-end speaker detection method reducing the 
noise level of a detected analog audio sequence s(t) according to the embodiment 
depicted in Fig. I of the present invention. 

Fig. 3b shows a flow chart illustrating a near-end speaker detection method according to 
the embodiment depicted in Fig. 2b of the present invention, and 

Fig. 3c shows a flow chart illiistrating a near^end speaker detection meOiod according to 
the embodiment dq>ict6d in Fig. 2c of the present invention. 

DETAILED DESCRIPTION OF THE UNDERLYING INVENTION 

In the following, different embodiments of the present invention as depicted in Figs. 1, 2b, 
5 2c, and 3a-c shall be explained in detail. The meaning of the symbols designated with ref- 
erence numerals and signs in Figs. 1 to 3c can be taken from an aimexed table. 

According to a first embodiment of the invention as depicted in Fig. 1, said noise reduction 
and speech activity recognition system 100 comprises a noise reduction circuit 106 which 

10 is specially adapted to reduce the background noise n\t) received by a microphone 101a 
and to perform a near-speaker detection by separating the speaker*s voice from said back- 
ground noise n'(t) as well as a multi-channel acoustic echo cancellation unit 108 being spe- 
cially adapted to poform a near-end speaker detection and/or double-talk detection algo- 
rithm based on acoustic-phonetic speech characteristics derived with the aid of the afore- 

is mentioned audio and visual feature extraction and analyzing means 104a+b and 106b, re- 
spectively. Thereby, said acoustic-phonetic speech characteristics are based on the opening 
of a speaker's mouth as an estimate of the acoustic energy of articulated vowels or diph- 
thongs, respectively, rapid movement of the speaker's lips as a hint to labial or labio-dental 
consonants (e.g. plosive, fricative or affricative phonemes - voiced or vmvoiced, respec- 

20 tively), and other statistically detected phonetic characteristics of an association between 
position and movement of the hps and the voice and pronunciation of a speaker 5,-. 
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The aibrementioned noise reduction circuit 106 conipiises digital signal processing means 
106a for calculating a discrete signal spectrum S(t^ that corresponds to an analog-to- 
digital-converted version s(nT) of the recorded audio sequence sit) by performing a Fast 
Fourier Transform (FFT), audio feature extraction and analyzing means 106b (e.g. an am- 

5 plitude detector) for detecting acoustic-phonetic speech characteristics of a speaks' s voice 
and pronunciation based on the recorded audio sequence s(t), means 106c for estimating 
the noise power density spectrum (/) of the statistically distributed background noise 
n\t) based on the result of the speaker detection procedure performed by said audio feature 
extraction and analyzing means 106b, a subtracting elcmmt 106d for subtracting a discre- 

10 tized version (k • A/") of the estimated noise power density spectrum (/) from the 
discrete signal spectrum Sih^^ of the analog-to-digital-converted audio sequence sinl), 
and digital signal processing means 106e for calculating the corresponding discrete time- 
domain signal Sf{nT) of the obtained difference signal by performing an Inverse Fast Fou- 
rier Transform (IFFT). 

15 

The depicted noise reduction and speech activity recognition system 100 comprises audio 
feature extraction and analyzing means 106b which are used for determining acoustic-pho- 
netic speech characteristics of the speaker's voice and pronunciation (awnr) based on the 
recorded audio sequence s{t) and visual feature extraction and analyzing means 104a+b for 
.20 determining the current location of the speaker's face at a data rate of 1 frame/s, tracking 
lip movements and/or facial expressions of said speaker jS,- at a data rate of 15 iframes/s and 
determining^ acoustic-phonetic speech characteristics of the speaker's voice and pronuncia- 
tion based on detected lip movements and/or facial expressions (ov,/ir). 

25 As depicted in Fig. 1, said noise reduction system 200b/c can advantageously be used for a 
video-telephony based application in a telecommunication system running on a video-en- 
abled phone 102 which is equipped with a built-in video camera 101b' pointing at the face 
of a speaker St participating in a video telq)hony session. 

30 Fig. 2b shows an example of a slow camera-enhanced noise reduction and speech activity 
recognition system 200b for a telephony-based application which implements an audio- 



wo 2004/066273 



9 



PCT/EP2004/000104 



visual speech activity estimation algorithm according to one embodiment of tite present 
invention. Thereby, an audio speech activity estimate taken fiom an audio feature vector 
Ofl^ supplied by said audio feature extraction and analyzing means 106b is correlated with a 
further speech activity estimate that is obtained by calculating the difference of the discrete 
5 signal spectrum Sik-Af) and a sampled version {k • A/) of the estimated noise power 
density spectrum (/) of the statistically distributed background noise n\t). Said audio 
speech activity estimate is obtained by an amplitude detection of the band-pass-filtered dis- 
crete signal spectrum SQcl^ of tiie analog-to-digital-converted audio signal s(t). 

10 Similar to the embodiment depicted in Fig. 1, the noise reduction and speech activity rec- 
ognition system 200b depicted in Fig. 2b comprises an audio feature extraction and ana- 
lyzing means 106b (e.g. an ampUtude detector) which is used for determining acoustic-pho- 
netic speech characteristics of the speaker's voice and pronunciation (2ii,nr) based on the 
recorded audio sequence s(f) and visual feature extraction and analyzing means 104' and 

15 104' ' for detemiining the current location of the speaker's face at a data rate of 1 firame/s, 
tracking lip movements and facial expressions of said speaker iS; at a data rate of IS 
fiames/s and detemiining acoustic-phonetic speech characteristics of the speaker's voice 
and pronunciation based on detected lip movements and/or facial expressions (Oiwir)- 
Thereby, said audio feature extraction and analyzing means 106b can simply be realized as 

20 an amphtude detector. 

Aside firom the components 106a-e described above with reference to Fig. 1, the noise re- 
duction circuit 106 depicted in Fig. 2b comprises a delay element 204, which provides a 
delayed version of the discrete signal spectrum S{k'AJ) of the analog-to-digital-converted 

25 audio signal a first multiplior element 107a, which is used for correlating (S9) the dis- 
crete signal spectrum Sj(k'd^ of a delayed version sinT-i:) of the analog-to-digital-con- 
verted audio signal s(nT) with a visual speech activity estimate taken firom a visual feature 
vector Qvj supplied by the visual feature extraction and analyang means 104a+b and/or 
104'+104", tiius yielding a fiirther estimiate for updating the estimate S^if) for 

30 the firequency spectrum Si(J) corresponding to the signal Si{t) that represents said speaker's 
voice as well as a further estimate 0„ ' (/) for updating the estimate (/) for the noise 
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power density spectrum (/) of the statistically distributed background noise n'(t), and 
a second multiplier elemeat 107, which is used for correlating (S8a) the discrete signal 
spectrum iSt(^A/) of a delayed version s(nT-i) of the analog-to-digital-converted audio sig- 
nal s(nT) with an audio speech activity estimate obtained hy an amplitude detection (S8b) 

5 of the band-pass-filtered discrete signal spectrum S(h^, tiius yielding an estimate 

for the frequency spectrum SilJ) which corresponds to tiie signal Si{t) that represents said 
speaker's voice and an estimate (/) for the noise power density spectrum (/) of 
said background noise n'(0. A sample-and-hold (S&H) element 106d' provides a sampled 
version ^^(k-Af) of the estimated noise power density spectrum 0^(f). The noise 

10 reduction circuit 106 furth^ comprises a band-pass filter with adjustable cut-off frequen- 
cies, which is used for filtering the discrete signal spectrum iS(/rA/) of the analog-to-dig^tal- 
convCTted audio signal s(ty The cut-off frequencies can be adjusted dependent on ttie band- 
width of the estimated speech signal spectrum Sjif) . A switch 106f is provided for selec- 
tively switching between a first and a second mode for receiving said speech signal Si{t) 

15 with and without using the proposed audio-visual speech recognition approach providing a 
noise-reduced speech signal , respectively. According to a fiirfher aspect of the pres- 
ent invention, means are provided for switching said microphone 101a off when the actual 
level of the speech activity indication signal s^inT) fells below a predefined threshold 
value (not shown). 

20 

An example of a fast camera-enhanced noise reduction and speech activity recognition 
system 200c for a telephony-based application which implements an audio-visual speech 
activity estimation algorithm according to a further embodiment of the present invention is 
depicted in Fig. 2c. The circuitry correlates a discrete signal spectrum SQc-Af) of the analog- 

25 to-digital-converted audio signal s{t) with a delayed version of an audio-visual speech ac- 
tivity estimate and a fiuiher speech activity estimate obtained by calculating the difference 
spectrum of the discrete signal spectrum Sih^ and a sampled version 0^(A: • A/) of the 
estimated noise power density spectrum 0^„{f) . The aforementioned audio-visual speech 
activity estimate is taken from an audio-visual feature vector o^vji obtained by combining an 

30 audio feature vector supplied by said audio feature extraction and analyzing means 
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106b with a visual feature vector Oy^ supplied by said visual speech activity detection mod- 
ule 104". 

Aside from the components described above with reference to Fig. 1, the noise reduction 
5 circuit 106 depicted in Fig. 2c comprises a summation element 107c, which is used for 
adding (SI la) an audio speech activity estimate supplied from an audio feature extraction 
and analyzing means 106b (e.g. an amplitude detector) for determining acoustic-phonetic 
speech characteristics of the speaker's voice and pronunciation (oa^nj) based on the re- 
corded audio sequence s(t) to an visual speech activity estimate siqiplied from visual fea- 

10 ture e)ctraction and analyzmg means 104' and 104" for determining the current location of 
the speaker's face at a data rate of 1 frame/s, tracking lip movements and &cial expressions 
of said speaker iS/ at a data rate of IS fiames/s and detemunmg acoustic-phonetic speech 
characteristics of the speaker's voice and pronunciation based on detected Up movemrats 
and/or &cial e;q)F6Ssions (2v»ii7)» thus yielding an audio-visual speech activity estimate. The 

15 noise reduction circuit 106 furttier comprises a multiplier element 107', which is used for 
correlating (SI lb) the discrete signal spectrum S{k-^ of the analog-to-digital-converted 
audio signal s(t) with an audio-visual speech activity estimate, obtained by combining an 
audio feature vector o^j supplied by said audio feature extraction and analyzing means 
106b with a visual feature vector g^^ suppUed by said visual speech activity detection mod- 

20 ule 104", thereby yielding an estimate S/(/) for the frequaacy spectrum Si{f) which corre- 
sponds to the signal s/(r) that represaits the speaker's voice and an estimate $ ^ (/) for the 
noise power density spectrum ^^(/) of the statistically distributed background noise 
n\t). A sample-and-hold (S&H) element 106d' provides a sampled version (k • 6f) of 
the estimated noise power density spectrum 0„„ (/) . The noise reduction circuit 106 ftir- 

25 ther comprises a band-pass filter witii adjustable cut-off frequencies, which is used for fil- 
. tering the discrete signal spectrum S{k'6f) of the analog-to-digital-converted audio signal 
s{t). Said cut-off frequencies can be adjusted dependent on the bandwidth of the estimated 
speech signal spectrum 5,(/). A switch 106f is provided for selectively switching be- 
tween a first and a second mode for receiving said speech signal Si(t) with and without us- 

30 ing the proposed audio-visual speech recognition approach providmg a noise-reduced 



wo 2004/066273 



12 



PCT/EP2004/000104 



speech signal 5, (0 , respectively. According to a furflier aspect of the present invention, 
said noise reduction system 200c comprises means (SW) for switching said microphone 
101a off when the actual level of the speech activity indication signal Si{nT) falls below a 
predefined threshold value (not shown). 

5 

A still further embodiment of the present invention is directed to a near-end speaker detec- 
tion method as shown in the flow chart depicted in Fig. 3a. Said metiiod reduces the noise 
level of a recorded analog audio sequence 5(0 being interfered by a statistically distributed 
background noise n'(0> said audio sequence representing the voice of a speaker St, ARjex 

10 having subjected (S 1) the analog audio sequence s(t) to an analog-to-digital conversion, the 
corresponding discrete signal spectrum SQcLJ) of the analog-to-digital-converted audio 
sequence s(nJ) \s calculated (S2) by performing a Fast Fourier Transform (FFT) and the 
voice of said speaker Si is detected (S3) fi-om said signal spectrum ^(^A/) by analyzing 
visual features extracted from a simultaneously with the recording of the analog audio se- 

15 quence s{i) recorded video sequence ^^nT) tracking the current location of the speaker's 
face, lip movements and/or facial expressions of the speaker St in subsequent images. Next, 
the noise power density spectrum (/) of the statistically distributed background noise 
n\i) is estimated (S4) based on the result of the speaker detection step (S3), whereupon a 
sampled version 0„„(A:- A/) of the estimated noise power density spectrum 0^(/) is 

20 subtracted (S5) from the discrete spectrum S(k'Af) of the analog-to-digital-converted audio 
sequence j(7i7). .Finally, the corresponding discrete time-domain signal s^inT) of the ob- 
tained diffi^rence signal, which represents a discrete version of the recognized speech sig- 
nal, is calculated (S6) by performing an Inverse Fast Fourier Transform (IFFT). 

25 Optionally, a miilti-channel acoustic echo cancellation algorithm which models echo path 
impulse responses by means of adaptive finite impulse response (FIR) filters and subtracts 
echo signals from the analog audio sequence s(t) can be conducted (S7) based on acoustic- 
phonetic speech characteristics derived by an algorithm for extracting visual features from 
a video sequence tracking the location of a speaker's face, lip movements and/or facial ex- 

30 pressions of the speaker Si in subsequent images. Said multi-channel acoustic echo cancel- 
lation algorithm th^by performs a double-talk detection procedure. 
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According to a further aspect of the invention, a learning procedure is sqjplied which en- 
hances the step of detecting (S3) the voice of said speaker Si from the discrete signal spec- 
trum iS(fc-A/) of the analog-to-digitai-converted version s{nT) of an analog audio sequence 
5 s{t) by analyzing visual features extracted from a simultaneously with the recording of the 
analog audio sequence s{t) recorded video sequence tracking tiie current location of the 
speaker's face, hp movements and/or facial expressions of the speaker St in subsequent im- 
ages. 

10 In one embodiment of the present invention, which is illustrated in the flow charts depicted 
in Figs. 3a+-b, a near-end speaker detection method is proposed that is characterized by the 
step of correlating (S8a) the discrete signal spectrum S^b/!^ of a delayed version sinT-x) 
of the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate 
obtained by an amplitude detection (S8b) of the band-pass-filtered discrete signal spectrum 

15 S(hAf), thereby yielding an estimate S^{f) for the frequency spectrum S^ which corre- 
sponds to the signal si(t) representing said speaker's voice and an estimate (/) for the 
noise power density spectrum (/) of said background noise n (/) . Moreover, the dis- 
crete signal spectrum St(^'A/) of a delayed version s{nT'X) of the analog-to-digital-con- 
verted audio signal s{nT) is correlated (S9) with a visual speech activity estimate taken 

20 from a visual feature vector Ovj which is supplied by the visual feature extraction and ana- 
lyzing means 104a+b and/or 104*+104", thus yielding a fiirfher estimate for up- 
dating the estimate S^if) for the frequency spectrum SjKf) which corresponds to the signal 
s^t) representing the speaker's voice as well as a further estimate '(/) that is used for 
updating the estimate 6„„ (/) for the noise power density spectrum (/) of the statisti- 

25 cally distributed background noise n\t). The noise reduction circuit 106 thereby provides a 
band-pass filter 204 for filtering the disarete signal spectrum 5(^40 of the analog-to- 
digital-converted audio signal s{tX wherein the cut-oflf frequencies of said band-pass filter 
204 are adjusted (SIO) dependent on flie bandwidth of the estimated speech signal spec- 
trum Siif). 
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In a further embodiment of the present invention as shown m the flow charts depicted in 
Figs. 3a+c a near-end speaker detection method is proposed which is characterized by the 
step of adding (SI la) an audio speech activity estimate obtained by an amplitude detection 
5 of the band-pass-filtered discrete signal spectrum 5(frA/) of the analog-to-digital-converted 
audio signal s(t) to a visual speech activity estimate taken from a visual feature vector Ov,t 
supplied by said visual feature extraction and analy2dng means 104a+b and/or 104'+104", 
thereby yielding an audio-visual speech activity estimate. According to this embodiment, 
the discrete signal spectrum S(hAf) is correlated (SI lb) with the audio-visual speech activ- 

10 ity estimate, thus yielding an estimate ?/(/) for the frequency spectrum correspond- 
ing to the signal Si{t) that represents said speaker's voice as well as an estimate (/) for 
the noise power density spectrum (/) of the statistically distributed background noise 
n\t). The cut-ofT frequencies of the band-pass filter 204 that is used for filtering the dis- 
crete signal spectrum S(h^ of the analog-^to-digital-converted audio signal s(t) are ad- 

15 justed (S 1 Ic) dependent on the bandwidth of the estimated speech signal spectrum Si (f) . 

Finally, the present invention also pertains to the use of a noise reduction system 200b/c 
and a corresponding near-end speaker detection melhod as described above for a video-te- 
lq)hony based application (e.g. a video conference) in a telecommunication system running 
20 on a video-enabled phone having a built-in video camera 101b' pointing at the face of a 
speaker St participating in a video telephony session. This especially pertains to a scenario 
where a number of persons are sitting in one room equipped with many cameras and mi- 
crophones such that a speaker's voice interferes with the voices of the other persons. 
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Table: Depicted Features and Their Corresponding Reference Sims 



Mo. 


iecmiical reature (bystem Uomponent or irocedure btepj 


100 


noise reduction and speech activity recognition system having an audio-visual user inter- 
nee, said system being specially adapted for running a real-time lip tracking s^lication 
which combines visual features Oj^nT extracted from a digital video sequence v(n7) 
showmg the face of a speaker j>/ by detecting and analyzing the speaker s Hp movemmts 
and/or facial expressions with audio features Oa^nr extracted &om an analog audio se- 
quence s(t) representing the voice of said speaker Si interfered by a statistically distrib- 
uted background noise n\t), wherein said audio sequmce s(t) includes - aside firom the 
signal rq)resenting the voice of said speaker St — both environmental noise n{t) and a 
weighted sum 5^- CLfs/^-Tj) (/ ^ i) of sttrrounding persons' interfering voices in Ihe envi- 
romnent of said speaker St 


1 ni a 


micropnone, usea lor recoromg an ancuog auuio sequence s\i) r^resenung ine voice oi a 
speaker Si interfered by a statistically distributed background noise n'(0> which includes 
both environmental noise n(t) and a weigjited sum Ey a/s/t-Tj) (with J ^ z) of surround- 
ing persons' interfering voices in the environment of said speaker Si 


lOla' 


analog-to-digital converter (ADC), used for converting ttie analog audio sequ^ce s{t) 
recorded by said microphone 101a into the digital domain 


101b 


video camera pointing to tiie speaker's face for recording a video sequence showing Up 
movements and/or facial expressions of said speaker Sf 


101b' 


video camera as described above with an integrated analog-to-digital converter (ADC) 


102 


video telephony application, used for transmitting a video sequence showing a speaker's 
face and lip movements in subsequent images 


104 


visual front end of an automatic audio-visual speech recognition system 100 using a 
bimodal approach to speech recognition and near-speaker detection by incorporating a 
real-time lip tracking algorithm for deriving additional visual features from lip move- 
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No. 


Technical Feature (System Component or Procediffe Step) 




meats and/or facial expressions of a speaker Si whose voice is interfered by a statisti- 
cally distributed background noise n'(0> the visual &ont end 104 comprising visual fea- 
ture extraction and analyzing means for continuously or intermittently determining the 
current location of the speaker's face, tracking lip movements and/or facial expressions 
of the speaker S{ in subsequent images and determining acoustic-phonetic speech char- 
acteristics of the speaker's voice and pronunciation based on detected lip movements 
and/or facial expressions 


104' 


visual feature extraction module for continuouslv tracking lit) movements and/or facial 
expressions of the speaker St and determining acoustic-phonetic speech characteristics 
of the speaker's voice based on detected lip movements and/or facial expressions 


104" 


visual speech activity detection module for analyzmg the acoustic-phonetic speech char- 
acteristics and detecting speech activity of a speaker based on said analysis 


104a 


visual feature extraction means for continuously or intermittently determining the cur- 
rent locatton of the sneaker's face recorded bv a video camera 101b at a rate of 1 fi*ame/s 


104b 


visual feature extraction and analyzing means for continuously tracking Up movements 
and/or facial expressions of the speaker St and determining acoustic-phonetic speech 
characteristics of said speaker's voice based on detected lip movements and/or facial 
e)q>ressions at a mte of 1 S frames/s 


106 

1 \J\J 


noise reduction circuit bein^ sneciallv adapted to reduce statisticallv distributed back- 
ground noise r'(0 received by said microphone 101a and perform a near-speaker detec- 
tion by separating the speaker's voice from said background noise n\t) based on a com- 
bination of the speech characteristics ^hich are derived by said audio and visual feature 
extraction and analyzing means 104a+b and 106b, respectively 


106a 


digital signal processing means for calculating the discrete signal spectrum S(Jc'^ that 
corresponds to an analog-to-digital-converted version s(nT) of the recorded audio se- 
quence s(t) by performing a Fast Fourier Transform (ht l ) 


106b 


audio feature extraction and analyzing means (e.g. an amplitude detector) for detecting 
acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based 
on the recorded audio sequence s{t) 


106c 


means for estimating the noise power density spectrum <I>^ (/) of the statistically dis- 
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No. 


Tecfanical Feature (System Ck)aq)oneat or Procedore Step) 




tributed background noise n'(t) based on the result of ttie speaker detection procedure 
nerfomisd bv said audio-visual feature extraction and analvzine means 104b. 106b. 104' 
and/or 104*' 


106c' 


means for estimating the signal spectium Siff) of the recorded speech signal s^t) based 
on the result of the speaker detection procedure performed by said audio-visual feature 

extraction and analyzing means 104b, 106b, 104' and/or 104" 


106d 


subtracting element for subtracting a discretized version (k - 6f) of the estimated 
noise power density spectrum from the discrete signal spectrum 5(A*A/) of the 
analog-to-digital-converted audio sequmce s{nT) 


106d* 


sanqple-and-hold (S&H) element providing a san^led version 0„ • 6f) of the esti- 
mated noise power density spectrum (/) 


106e 


digital signal processing means for calculating the corresponding discrete time-domain 
signal s^{nT) of the obtained difference signal by performing an Inverse Fast Fourier 

Transform (IFFl') 


106f 


switch for selectively switching between a first and a second mode for receiving said 
speech signal sdt) with and without using the proposed audio-visual speech recognition 
approach providing a noise-reduced speech signal 5,. (/) , respectively 


107 


multiplier element, used for coirelating the discrete signal q)ectrum S{hLf) of tiie ana- 
log-to-digital-converted audio signal s{{) with an audio speech activity estimate which is 
obtained by an amplitude detection of the digital audio signal s{nJ) 


107' 


multiplier element, used for correlating (SI lb) the discrete signal spectrum S(k'AJ) of 
the analog-to-digital-converted audio signal s(t) with an audio-visual speech activity es- 
timate, obtained by combining an audio feature vector supplied by said audio feature 
extraction and analyzing means 106b with a visual feature vector Ov^ supplied by said 
visual speech activity detection module 104", thereby yielding an estimate for 
the frequency spectrum Si{f) corresponding to the signal Si{t) which represents said 
speaker*s voice and an estimate (/) for the noise power density spectrum (/) 
of the statistically distributed background noise n\t) 
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No. 


Technical Feature (System Component or Procedure Step) 


107a 


multiplier element, used for coirelating (S9) the discrete signal spectrum S^h^ of a 
delayed version sinT-i) of flie analog4o-digital-converted audio signal sin!) with a vis- 
ual speech activity estimate taken fix)m a visual feature vector o^j supplied by the visual 
feature extraction and analyzing means 104a+b and/or 104H104", thereby yielding a 
iEurther estimate S/{f) for iq>dadng the estimate S^if) for the frequency spectrum 
jSiO) corresponding to the signal s^t) i^hich represents said speaker's voice as well as a 
furflier estimate ' (/) for iqjdating the estimate 6^ (/) for the noise power density 
spectrum (/) of the statistically distributed background noise n\t) 


107b 


multiplier element, used for correlatiiig (S8a) the discrete signal spectrum S-^k-^ of a 
delayed version s{nT-x) of the analog-to-digital-converted audio signal s(nT) witii an 
audio speech activity estimate obtained by an anq>litude detection (S8b) of the band- 
pass-filtered discrete signal spectrum S(hAJ), thereby yielding an estimate for the 
frequency spectrum Si{f) corresponding to the signal Si(t) which represents said speaker's 
voice as well as an estimate (/) for the noise power density spectrum ^«,(/) of 
the statistically distributed background noise n\t) 


107c 


summation element, used for adding (SI la) the audio ^eech activity estimate to the 
visual speech activity estinmte, thereby yielding an audio-visual speech activity estimate 


108 


multi-channel acoustic echo cancellation unit being specially adapted to perform a near- 
end speech detection and/or double-talk detection algorithm based on acoustic-phonetic 
speech characteristics derived by said audio and visual feature extraction and analyzing 
means 104a+b and 106b, respectively 


108a 


means for near-end talk and/or double-talk detection, integrated in the multi-channel 
acoustic echo cancellation unit 108 


200a 


block diagram showing a conventional noise reduction and speech activity recognition 
system for a telephony-based application based on an audio speech activity estimation 
according to the state of the art, wherein the discrete signal spectrum S(k dkJ) of tihe ana- 
log-to-digital-converted audio signal s(t) is correlated wifli an audio speech activity es- 
timate which is obtained by an amplitude detection of the digital audio signal s(nT) 


200b 


block diagram showing an example of a slow camera-enhanced noise reduction and 
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No. 


Technical Feature (System Componmt or Procedure Step) 




speech activity recognition system for a telephony-based application implementing an 
audio-visual speech activity estimation algorithm according to one embodiment of the 
present invention, wherein the discrete signal spectrum Sx(fc*A/) of a delayed version 
s(nT'X) of the analog-to-digital-converted audio signal s(nT) is correlated (S8a) with an 
audio speech activity estimate obtained by an amplitude detection (S8b) of the band- 
pass-filtered discrete signal spectrum Sih^, hereby yielding an estimate (/) for the 
frequency spectrum Stif) conresponding to the signal St{t) which represents said speaker's 
voice and an estimate (/) for the noise power density spectrum <I>^ (/) of the sta- 
tistically distributed background noise n'(t), and also coirelated (S9) with a visual 
speech activity estimate taken from a visual feature vector Oyjt supplied by the visual 
feature extraction and analyzing meaaos 104a-rt) and/or 104'+104", thereby yielding a 
further estimate for updating the estimate for the firequency spectrum 
Si{f) corresponding to the signal s^t) which rqjresents said speaker's voice as well as a 
further estimate ' (/) for updating the estimate (/) for the noise power density 
spectrum of the statistically distributed background noise 7z'(0 


200c 


block diagram showing an sample of a fast camera-enhanced noise reduction and 
speech activity recognition system for a telephony-based application implementing an 
audio-visual speech activity estimation algorithm accenting to a further embodiment of 
the present invention, wherein tiie discrete signal spectrum S{kAf) of the analog-to- 
digital-converted audio signal s{t) is correlated (SI lb) with an audio-visual speech ac- 
tivity estimate, obtained by combining an audio feature vector o^^ which is supplied by 
said audio feature extraction and analyzing means 106b with a visual feature vector Oy^t 
supplied by the visual speech activity detection module 104", thereby yielding an esti- 
mate iS,.(/) for the corresponding frequency spectrum of the signal which rep- 
resents said speaker's voice as well as an estimate for the noise power density 
spectrum 0„„ (/) of the statistically distributed background noise n'(t) 


202 


delay element, providing a delayed version of the discrete signal spectrum S{hd^ of the 
analog-to-digital-converted audio signal s(t) 
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No. 


Technical Feature (System Component or Procedure Step) 


204 


band-pass filter with adjustable cut-off frequencies which can be adjusted dependent on 
the bandwidth of the estimated speech signal spectrum S^^f) , used for filtering the dis- 
crete signal spectrum S(k A/) of the analog-to-digital-converted audio signal s(t) 


300a 


flow chart illustratinp a. near-find TOealrer detfiction methoH nertnrincr tfip nniQp IpvpI nf -a 

detected analog audio sequence s(t) according to the embodiment depicted in Fig. 1 of 
the nresent invention 


300b 


flow chart illustrating a near-end speaker detection method according to the embodiment 
depicted in Fig. 2b of the present invention 


300c 


flow chart illustrating a near-end speaker detection method according to tiie embodiment 
depicted in Fig. 2c of the present invention 


SW 


means for switching said microphone 101a off when the actual level of the speech ac- 
tivity indication signal {nT) Ms below a predefined threshold value (not shown) 


SI 


step #1 : subjecting the analog audio sequence s(t) to an analog-to-digital conversion 


SIO 


step #10: adjusting the cut-off frequencies of the band-pass filter 204 used for filtering 
the discrete signal spectrum S(hAf) of the analog-to-digital-converted audio signal 
dependent on the bandwidth of the estimated speech signal spectrum Sf{f) 


SI la 


otyp rrx ICU oUUlllg cUl aUUlU apcCbll ai/UVXljr waiUllalC WXUUXl 15 UUUiLUCU Oj aH 2UlipXllUC16 

detection of the band-pass-filtered discrete signal spectrum S(h^ of the analog-to- 
digital-converted audio signal s(t) to a visual speech activity estimate taken &om a vis- 
ual feature vector Oy^ supplied by the visual feature extraction and analyzmg means 
104a+b and/or 104*+ 104*', thereby yielding an audio-visual speech activity estimate 


SI lb 


step #llb: correlating the discrete signal spectrum S{k'Af) with the audio-visual speech 
activity estimate, thereby yielding an estimate for the firequency spectrum Si{f) 
corresponding to the signal s^t) which represents said speaker's voice as well as an es- 
timate <b^if) for said noise power density spectrum O^if) 


Sllc 


step #llc: adjusting the cut-off fiequencies of a band-pass filter 204 used for filtaing" 
the discrete signal spectrum S(hAf) of the analog-to-digital-converted audio signal sit) 
dependent on the bandwidth of the estimated speech signal spectrum Sf{f) 
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No. 


Technical Feature (System Component or Procedme Step) 


S2 


step #2: calculating the corresponding discrete signal spectrum SQcAf) of the analog-to* 
dioital-converted audio seauence s(nT^ hvoerfhrmina a Fast FoiTrier Tran^sfrnm fRPT^ 


S3 


step #3: detecting the voice of said speaker Si from said signal spectrum S(hAf) by ana- 
lyzing visual features extracted fiom a simultaneously wifli the recording of the audio 
sequence recorded video sequence for tracking the current location of the speaker's 
face, lip movements and/or facial expressions of the speaker St in subsequent images, 


S4 


step #4: estimating the noise power density spectrum (/) of the statistically distrib- 
uted background noise n'(t) based on the result of the speaker detection step S3 


S5 


step #5: subtracting a discretized version 6^ (k • ^) of the estimated noise power den- 
sity spectrum fix)m tiie discrete signal spectrum 5(^A/) of the analog-to-digital- 
converted audio sequence s(nT) 


S6 


step #6: calculating the corresponding discrete time-domain signal S^inT) of the ob- 
tained diflFerence signal by performing an Inverse Fast Fourier Transform {Uffi ) 


S7 


step #7: conducting a multi-channel acoustic echo cancellation algorithm which models 
echo path impulse responses by means of adaptive finite impulse response (FIR) filters 
and subtracts echo signals fi'om the analog audio sequence s(t) based on acoustic- 
pnonenc speecn cnaraciensucs uenvea oy an aigonmm tor extractmg visual leatures 
fix>m a video sequence tracking the location of a speaker's face, lip movements and/or 
&cial expressions of the speaker Si in subsequent images 


S8o 


alcp trou. uazui-pass-iiiiermg me uiscreie signal specxrum uyic^i^) oi me anaiog-to-cugi- 
tal-converted audio signal s{nT) 


SSa 


step #8a: correlating the discrete signal spectrum Sj{hAJ) of a delayed version s{nT''i) of 
the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate 
obtained by the amplitude detection step S8b 


S8b 


step #8b: amplitude detection of the band-pass-filtered discrete signal spectrum S{h6f), 
thereby yielding an estimate 5, (/) for flie fi-equency spectrum S^ corresponding to the 
signal Si(t) which represents said speaker*s voice as well as an estimate for the 
noise power density spectrum (/) 
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No. 


Tedmical Feature (System Conqranent or Procedure Step) 


S9 


step #9: correlating the discrete signal spectrum S^hd^ of a delayed version s(nT-T) of 
the analog-to-digital-coiiverted audio signal s(tiT) with a visual speech activity estimate 
taken from a visual feature vector Oy^ supplied by the visual feature extraction and ana- 
lyzing means 104a+b and/or 104'+104", thereby yielding a further estimate S/(f) for 
updating the estimate S^if) for the frequency spectrum corresponding to the signal 
Si{t) which represents said speaker's voice as well as a further estimate ' (/) for up- 
dating the estimate (/) for the noise power density spectrum (/) of the statis- 
tically distributed background noise n'(0 


SIO 


step #10: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering the 
discrete signal spectrum S(hAf) of the analog-to-digital-converted audio signal s(t) de- 
pendent on the bandwidth of the estimated speech signal spectrum 


^1 In 


oten H'X 1 a* addino an aiidin weecTi activitv estimate obtained hv an amnlitude detection 
of the band-pass-filtered discrete signal spectrum Sihl^ of the analog-to-digital-con- 
verted audio signal s(t) to a visual speech activity estimate taken fix>m a visual feature 
vector Ovjt supplied by said visual feature ^traction and analyzing means 104a+b, and/or 
104'+104", thereby yieldmg an audio-visual speech activity estimate 


Sllb 


step #llb: coirelating the discrete signal spectrum S^hl^ with the audio-visual speech 
activity estimate, ttius yielding an estimate for the frequency spectrum Si{f) cor- 
responding to the signal Siit) which represents said speaker's voice as well as an esti- 
mate (/) for the noise power density spectrum (/) of flie statistically distrib- 
uted background noise n\t) 


Sllc 


step #llc: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering 
the discrete signal spectrum S(k'Af) of the analog-to-digital-convCTted audio signal s(t) 
dependent on the bandwidth of the estimated speech signal spectrum 
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Claims 

1 . A noise reduction system with an audio-visual user interface, said system being specially 
adapted for running an application for combining visual features (ov,nr) extracted from a 

5 digital video sequence (vQtT)) showing the fece of a speaker {Si) with audio features (ai,nr) 
extracted from an analog audio sequence {s(t)), wherein said audio sequence (s{ty) can in- 
clude noise in the environment of said speako: (SiX said noise reduction system (200b/c) 
comprising 

- means (101a, 106b) for detecting and analyzing said analog audio sequence (s(t)), 
10 - means (101b*) for detecting said video sequence (v(n2)), and 

- means (104a+b, 104'+104") for analyzmg the detected video signal (v(»2)), 
characterized by 

a noise reduction circuit (106) being ad^ted to separate the speaker's voice fiom said 
background noise (/i'(0) based on a combination of derived speech characteristics (o^v^r 
15 := \oa^T9 Qv^/]^) and outputting a speech activity indication signal (f , {nT) ) which is 
obtained by a combination of speech activity estimates suppUed by said analyzing means 
(106b, 104a+b, 104'+104"). 

2. A noise reduction system according to claim 1 , 
20 characterized by 

means (S W) for switching off an audio channel in case the actual level of said speech ac- 
tivity indication signal (f ^ {nT)) fells below a predefined threshold value. 

3. A noise reduction system according to anyone of the claims 1 or 2, 
25 characterized by 

a multi-channel acoustic echo cancellation unit (108) being specially adapted to perform a 
near-end speaker detection and double-talk detection algorithm based on acoustic-phonetic 
^speech characteristics derived by said audio feature extraction and analyzing means (106b) 
and said visual feature extraction and analyzing means (104a+b, 104'+104")- 
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4. A noise reduction system according to anyone of the claims 1 to 3, 
characterized in that 

said audio feature extraction and analyzing means (l66b) is an amplitude detector. 

5 5. A near-end speaker detection method reducing the noise level of a detected analog audio 
sequence (s(t)), 

said method being characterized by the following steps: 

- subjecting (SI) said analog audio sequence (s(t)) to an analog-to-digital conversion, 

- calculating (S2) tfie cotresponding discrete signal spectrum {S{hi^) of the analog-to- 
10 digital-converted audio sequence (s(nJ)) by performing a Fast Fourier Transform (FFT), 

- detecting (S3) the voice of said speaker {Si) from said signal spectrum (lS{kJbf)) by ana- 
lyzing visual features (gy^ni) extracted from a simultaneously with the recording of the 
analog audio sequence (s(ty) recorded video sequence (v(n7)) tracking the current loca- 
tion of the speaker's face, lip movements and/or facial ^ressions of tfie speaker (Si) in 

15 subsequent images, 

- estimating (S4) the noise power density spectrum (O^ (/)) of the statistically distrib- 
uted background noise (n(0) based on the result of the speaker detection step (S3), 

- subtracting (S5) a discretized version (O^ (A: • A/")) of the estimated noise power den- 
sity spectrum (0^(f)) from the discrete signal spectrum (5(fcA/)) of the analog-to- 

• 20 digital-converted audio sequence (s{nT)\ and 

- calculating (S6) the corresponding discrete tinae-domain signal (Sg(nT)) of the obtained 
difference signal by performmg an Liverse Fast Fourier Transform (DFFT), thereby 
yielding a discrete v^ion of the recognized speech signal. 

25 6. A near-end speaker detection method according to claim 5, 
characterized by the step of 

conducting (S7) a multi-channel acoustic echo cancellation algorithm which models echo 
path impulse responses by means of adaptive finite impulse response (FIR) filters and sub- 
tracts echo signals from the analog audio sequence {s(t)) based on acoustic-phonetic speech 
30 characteristics derived by an algorithm for extracting visual features {0^17) from a video 
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sequence (v(w7)) tracking the location of a speaker's face, lip movements and/or fecial ex- 
pressions of the speaker (St) in subsequrat images. 

7. A near-end speaker detection method according to claim 6, 
characterized in that 

said multi-channel acoustic echo cancellation algorithm perfonns a double-talk detection 
procedure. 

8. A near-end weaker detection metibod according to anyone of the claims S to 7, 
characterized in that 

said acoustic-phonetic speech characteristics are based on the opening of a speaker's mouth 
as an estimate of the acoustic energy of articulated vowels or diphthongs, respectively, 
rapid movement of the speaker's lips as a hint to labial or labio-dental consonants, respec- 
tively, and other statistically detected phonetic characteristics of an association between 
position and movement of the lips and the voice and pronunciation of said speaks (Si). 

9. A near-end speaker detection method according to anyone of the claims 5 to 8, 
characterized by 

a learning procedure used for enhancing the step of detecting (S3) tiie voice of said speaker 
(Si) from the discrete signal spectrum (S(h^^) of the analog-to-digital-converted version 
(s(nr)) of an analog audio sequence (s(t)) by analyzing visual features (gy,nr) extracted 
from a simultaneously with the recording of the analog audio sequence (s(t)) recorded 
video sequaice (v(nT)) tracking the current location of the speaker's fece, lip movements 
and/or facial expressions of the speaker (S{) in subsequent images. 

10. A near-end speaker detection method according to anyone of the claims 5 to 9, 
characterized by the step of 

correlating (S8a) the discrete signal spectrum (Sx(k'Af)) of a delayed version (s(nT-x)) of the 
analog-to-digital-converted audio signal (.y(«7)) with an audio speech activity estimate ob- 
tained by an amplitude detection (S8b) of the band-pass-filtered discrete signal spectrum 
(S(k'AJ)), thereby yielding an estimate (5, (/)) for the frequency spectrum (5,</)) corre- 
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sfponding to the signal (si{t)) which represmts said speaker's voice as well as an estimate 
«,(/)) for flie noise power density spectrum of the statistically distributed 

background noise (n\t)y 

11 . A near-end speaker detection method according to claim 10, 
characterized by the step of 

correlating (S9) the discrete signal spectrum (S't(**A/)) of a delayed version (sinT-x)) of the 
analog-to-digital-converted audio signal (s(nl)) with a visual speedi activity estimate taken 
from a visual feature vector (oy^) si^plied by the visual feature extraction and analyzing 
means (104a+b, 104'+1M"), thereby yielding a further estimate for updating the 

estimate (Sf (/)) for the fiequency spectrum (S^) coirespondmg to the signal (si(t)) 
which represents said speaker's voice as well as a furth^ estintiate ' (/)) for updating 
the estimate (O^if)) for the noise power density spectrum of the statistically 

distributed background noise (n\t)). 

12. A near-end speaker detection method according anyone of the claims 10 or 11, 
characterized by the step of 

adjusting (S 10) the cut-off frequencies of a band-pass filter (204) used for filtering the dis- 
crete signal spectrum (Sih^) of the analog-to-digital-converted audio signal (j(/)) de- 
pendent on the bandwidth of the estimated speech signal spectrum (iS,(/)) . 

13. A near-end speaker detection method according to anyone of the claims 5 to 9, 
characterized by the steps of 

- adding (S 1 1 a) an audio speech activity estimate obtained by an amplitude detection of 
the band-pass-filtered discrete signal spectrum {S{hAf)) of the analog-to-digital- 
converted audio signal (s(t)) to a visual speech activity estimate taken from a visual 
feature vector (gv,/) supplied by said visual feature extraction and analyzing means 
(104a+b, 104'+104'0, thereby yielding an audio-visual speech activity estimate, 

- correlating (SI lb) the discrete signal spectrum (SihAJ)) with tiie audio-visual speech 
activity estimate, thereby yielding an estimate (5^ (/)) for the frequency spectrum {Stif)) 
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coiresponding to the signal (s^t)) which rq>reseDts said speaker's voice as well as an 
estimate for the noise power density spectrum (0^(J^) of the statistically 

distributed background noise (n'(0) ™d 
- adjusting (Sllc) the cut-off frequencies of a band-pass filter (204) used for filtering the 
5 discrete signal spectrum (S(hd^) of the analog-to-digital-converted audio signal (s(t)) 
dependent on the bandwid& of the estimated speech signal spectrum (S^ (/)) . 

14. Use of a noise reduction system (200b/c) according to anyone of the claims 1 to 4 and a 
near-end speaker detection method according to anyone of the claims 5 to 13 for a video- 

10 telephony based application in a teleconmumication system running on a video-enabled 
phone with a built-in video camera (101b') pointing at the face of a speaker (Si) participat- 
ing in a video telephony session. 

15. A telecommunication device equipped with an audio-visual user interface, 
15 characterized by 

noise reduction system (200b/c) according to anyone of the claintis 1 to 4. 
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— ^ZL 

Recording on onolog audio sequence s(0 representing the voice of a speaker S- 
interfered by stotisticoily distributed bockround noise n (t),said oudio sequence ' 
including both environmentol noise n(t) and weighted sumint(t) of surrounding 
persons interfering voices in the environment of said spaekers S; 




S$b 



I 



Recording a digital video sequence v(nT) showing the face of said speaker S- for/ 
detecting and onolyzing soid spe oker's lip movements and for faciei expressioos / 

i ^ 

Subjecting soid onolog oudio sequence\(t) to an anolog -to-digitol conversion 



S2- 



Calculating the corresponding discrete signol spectrum S(k-Af)of the onolog-to- 
digitol-converted audio sequence s(nT) by performing a Fast Fourier Transform (FFT) 



S3- 



Delecting the voice of said speaker S\ from soid signal spectrum S(k-Af) by ana- 
lyzing visual features Oy , extracted from the video sequence v(nT) tracking said 
lip movements and for lociol expressions 



S4- 



I 



Estimoting the noise power density spektrum $nn(f) of the stotisticoily distributed 
background noise n(t) based on the result of the speaker detection step (S3) 



S5 



Subtrocting^a discretized V8rsion?nn(k-Af)of the estimoted noise power density 
spectrum *nn(Ofrom the discrete signal spectrum S(k-Af) of the onalog-to-di- 
gitol -converted audio sequence s(nT) 



S6- 



I 



Calculating the corresponding discrete time-domain signal Si(nT)of the obtained 
difference signal by performing on Inverse Fourier Transform (IFR) .thereby 
yielding o discrete version of the recognized speech signal 
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300b 



® 



S8o- 



Band-pass-filtering Ihe discrete signal spectrum S(k-Af) 



sab- 



Performing an amplitude detection of the band-pass-filtered discrete 

signol spectrum S(k-Af) 



S8a^ 



Correlating the discrete signal spectrum S'[(k-Af) of the delayed 
version sfnT-l) of the analog-to-digital-converted audio signal s(nT) 
with an audio speech octivity estimate obtoinej by said amplitude de- 
tection step S8b,thereby yielding an estimate Si(f) for the frequency 
spectrum Si(f) corresponding to the signol s; (t) which represents soids 
speaker's voice as well as an estimate 4)pn(f) for the noise power den- 
sity spectrum ^nfffloi the statistically distributed background noise n'(t) 



Correlating the discrete signal spectrum S'[(k-Af) with o visual speech 
octivi^ estimate token from a visual feature vector Oy ^ supplied by 
the visual feoture extraction and onaiyzing^rneons 1045]+b and/or 104' 
+ 104 '^thereby yielding o further estimate Si(f) for opdating the esti- 
mate Si(f) for the frequency spectrum Si(f) as well as further estimate 
*nn'(0 for updating the estimate ?nn(f) 



S10-\J Adjusting the cut-off frequencies of a bond-pass filter 204 used for 
filtering the discrete signal spectrum S(k-Af) dependent on the band- 
width of the estimated speech signal spectrum S;(f) 



® 
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300c 



® 







Band-poss-fillering the discrete signal spectrum S{k-Af) 




r 


Performing an amplitude detection of the band-pass-filtered discrete 

signal spectrum S(l(*Af) 







SllQ- 



Adding on audio speech activity estimate obtained by the amplitude 
detection step S8b to a visual speech activity estimate taken from a 
visual feature vector Oy j supplied by said visuol feature extraction and 
analyzing means 104a+b and/or 104' +104", thereby yielding on audio- 
visual speech activity estimate 



Sllb- 



Correloting the discrete signal spectrum S(lcAf) with the audio -visual 
speech activity estimate.thereby yielding an estimate Si(f) for the fre- 
quency spectrum Si(f) corresponding to the signal S| (t) which represents 
said speaker's voice as well as an estimate $nn(0 ^or the noise power 
density spectrum of the statisticolly distributed background noise n'(t) 



Sllc- 



Adjusting the cut-off frequencies of o bond-poss filter 204 used for 
filtering the discrete signol spectrum S(k-Af) dependent on the band- 
width of the estimated speech signal spectrum Sj (f) 
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